Due to the design of attentional-based algorithms, most social media users experience echo chambers
This minimizes the ability to understand a complete picture of public opinion(and narrows individual opinion)
Is there a way to quickly access a different side to the story?
This project's aim is to input a word/trend/topic or hashtag, and a sentiment. The output is a summarised understanding of the key points. Essentially like the google search of public opinion, hopefully to be used for people to access a greater scope of opinions quickly.
Twitter is the home of contemporary public opinion, and therefore the perfect place to start
Let's look at some data to get a better understanding
import tweepy
from tweepy import OAuthHandler
import pandas as pd
print("You got this!")
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
# count = 1
You got this!
Let's see a live example
Let's use a simple example of Licorice Pizza, an award nominated but highly controversial movie recently released.
tweets = []
for tweet in tweepy.Cursor(api.search_tweets, q = "#LicoricePizza", count=10, since='2022-01-28', lang = "en").items(200):
# print(count)
# count += 1
try:
data = [tweet.created_at, tweet.id, tweet.text, tweet.retweet_count, tweet.favorite_count, tweet.lang]
data = tuple(data)
tweets.append(data)
except tweepy.TweepError as e:
print(e.reason)
continue
except StopIteration:
break
# df = pd.DataFrame(tweets, columns = ['created_at','tweet_id', 'tweet_text', "retweet_count", "favorite_count", "lang"])
# """Add the path to the folder you want to save the CSV file in as well as what you want the CSV file to be named inside the single quotations"""
# df.to_csv(path_or_buf = '/Users/caselyhayford/Desktop/Twitter Experiments/Tweets.csv/', index=False)
Just some of the query possibilities:
df = pd.DataFrame(tweets, columns = ['created_at','tweet_id', 'tweet_text', "retweet_count", "favorite_count", "lang"])
df.shape
(200, 6)
df.head(10)
| created_at | tweet_id | tweet_text | retweet_count | favorite_count | lang | |
|---|---|---|---|---|---|---|
| 0 | 2022-02-19 11:20:56+00:00 | 1494995410510823424 | A 15-year-old falling in love with a 25-year-o... | 0 | 0 | en |
| 1 | 2022-02-19 11:20:47+00:00 | 1494995374364405760 | "Don't be creepy," says the 25 year old to a 1... | 0 | 0 | en |
| 2 | 2022-02-19 11:11:40+00:00 | 1494993078721159169 | RT @universaluk: The nominations are in. #Lico... | 2 | 0 | en |
| 3 | 2022-02-19 11:08:21+00:00 | 1494992245241442304 | #LicoricePizza is too good | 0 | 0 | en |
| 4 | 2022-02-19 10:59:34+00:00 | 1494990035552112642 | RT @cineastmemes: Finally it's now available i... | 1 | 0 | en |
| 5 | 2022-02-19 10:34:34+00:00 | 1494983743538630658 | now watching #LicoricePizza FINALLY!!!! https:... | 0 | 1 | en |
| 6 | 2022-02-19 10:08:17+00:00 | 1494977127514447874 | #LicoricePizza is now on vod I repeat #Licoric... | 0 | 0 | en |
| 7 | 2022-02-19 10:02:28+00:00 | 1494975663542915075 | Finally it's now available in our channel the ... | 1 | 2 | en |
| 8 | 2022-02-19 10:01:39+00:00 | 1494975461515923460 | watching #LicoricePizza and there’s an actual ... | 0 | 0 | en |
| 9 | 2022-02-19 09:57:46+00:00 | 1494974480262729729 | Finallllllly #LicoricePizza 😌😌 https://t.co/Dn... | 0 | 0 | en |
print(df["tweet_text"][0])
A 15-year-old falling in love with a 25-year-old should not be normalized or romanticized regardless of gender. #LicoricePizza
A negative sentiment highlighting the age gap issue
print(df["tweet_text"][199])
What a fantastic soundtrack (lovely cover too)! Discovered new favourites thanks to Paul Thomas Anderson. Can't wai… https://t.co/mguHg4buyL
print(df["tweet_text"][3])
#LicoricePizza is too good
Some positive tweets, talking about quality and the amazing soundtrack
Approximate Steps:
Scrape the corpus (dateframe) from twitter
Clean the text data (NLTK or spaCy)
Binary Classification Sentiment Analysis to split the data into positive, negative (could include neutral). Here we can either train our own model (risky as it is hard to generalize to any tweet without training huge numbers), or use transfer learning using a pre-trained model. Say Hello to Hugging Face!
Now we have our corpus divided by sentiment. Using the new subset corpus of the appropriate sentiment (one selected by the user), we will now use doc2vec from gensim (same as word2vec, but used for whole documents (tweets), for the embedding of whole tweets. We can then perform LSTM (or CNN) DL to create clusters of the tweets based on patterns.
We can visualize these clusters as output while supplying the central tweets of these clusters, showing the core positions of the sentiment based tweets.
Or WordCloud ??!
For each cluster (topic) of tweets, we generate a summary tweet, written by a GAN model.